Advisory mission for the Boston municipal authorities¶

title

Welcome to the Decision Science exercise of your data certification exam!

Here are a few words to describe how the decision_science directory is organised:

In [1]:
# Start by running this cell to see the architecture of the directory
!tree
.
├── README.md
├── boston_crimes.ipynb
├── data
├── data.py
├── db
└── tests

3 directories, 3 files
  • the boston_crimes.ipynb notebook that you currently have under your eyes is the main document. You will find all the instructions here and except when it is explicitly specified, you should provide all your answers in this notebook;
  • the data and db folders will be filled-in throughout the exercise respectively with .csv datasets and a .sqlite file, for you to run all your analyses;
  • you will not have to interact with the assets folder for this exercise;
  • the tests folder will contain all the .pickle files that will be saved throughout the exercise with your key findings. Please run all the "Save your results" cells when completing the exercise!

⚠️ Important remark before you dive into the exercise. This notebook is quite long and it is easy to get lost in it: take full advantage of the collapsible headers and of the table of content. If you have not yet activated these Jupyter Notebook extensions, you may consider doing so now!

Imports¶

You can use this section to run your imports in a centralised manner throughout the exercise.

In [2]:
# Load the nbresult package to be able to save your results 
from nbresult import ChallengeResult
In [3]:
# Useful import for data collection
import sqlite3
In [4]:
# Useful imports for data manipulation and analysis
import numpy as np
import pandas as pd
In [5]:
# Useful imports for data visualisation
import matplotlib.pyplot as plt
import seaborn as sns
In [6]:
# Useful imports to estimate regression models
import statsmodels.formula.api as smf

1. Analysis for the mayor's team¶

During the last municipal campaign in Boston, criminality has been a major topic of debates. As citizens have expressed strong expectations from her on that front, the newly-elected mayor of Boston is looking for data-based insights on criminality in the Massachussetts capital. She has mandated your economics and urbanism consulting firm, The Locomotive, for this study.

1.1 Load the database¶

Download the boston_crimes.sqlite database from this URL and store it inside the db folder.

In [7]:
# You may directly run this cell to do so
!curl https://wagon-public-datasets.s3.amazonaws.com/certification_france_2021_q2/boston_crimes.sqlite > db/boston_crimes.sqlite
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 19.1M  100 19.1M    0     0   488k      0  0:00:40  0:00:40 --:--:-- 44505

1.2 Familiarize yourself with the database¶

This section aims at giving you a first overview of the database.

As you will see, it consists in three tables:

  • the incident_reports table has been provided by the Police Department of Boston. Each observation corresponds to a criminal incident that has required an intervention by the police in the municipality of Boston;
  • the districts table has been provided by the Urbanism Department of Boston. It gathers geographical information about the various police districts of Boston;
  • and the indicators table has been shared by the Economics Department of Boston, which keeps track of various indicators of the social and economic activity of Boston neighborhoods. Each observation corresponds to a police district.

More information about the different fields of these three tables can be found in the dedicated README.md file

Connect to the database¶

🔍 Using your preferred SQL client, connect to the database and browse through it as you wish to get acquainted with the data.

Draw a schema of the database¶

📝 Draw the database schema thanks to the schema editor on Kitt.

📝 Download the schema and save it as boston_crimes.xml in the db folder.

1.3 Extract the relevant dataset¶

Now that you have a good overview of the database, you can kick off the work! You will start with an SQL query to gather the relevant information.

Build the dataset¶

📝 Write the SQL query you need to fetch the data. Save it as a str in the query variable.

In [15]:
import sqlite3

conn = sqlite3.connect("/Users/jaouadahraoui/code/Jaouad-AHRAOUI/data-certification-exam/decision_science/db/boston_crimes.sqlite")
c = conn.cursor()

# TODO: write the query
query = """
SELECT indicators.CODE, NAME, COUNT(INCIDENT_NUMBER) AS NB_INCIDENTS, 
MEDIAN_AGE, TOTAL_POP, PERC_OF_30_34, PERC_MARRIED_COUPLE_FAMILY, PER_CAPITA_INCOME, PERC_OTHER_STATE_OR_ABROAD, 
PERC_LESS_THAN_HIGH_SCHOOL, PERC_COLLEGE_GRADUATES
FROM indicators JOIN districts ON indicators.CODE = districts.CODE JOIN incident_reports ON incident_reports.DISTRICT = districts.CODE
"""
# TODO: Execute the query
c.execute(query)
# TODO: Fetch and print the results
c.fetchall()
Out[15]:
[('D14', 'Brighton', 237221, 30.8, 55297, 52.8, 26.4, 41261, 8.6, 6.7, 10.5)]

📝 Store the output of the query in a DataFrame named crimes_df. Display the 5 first rows, as well as the shape of the dataset.

In [16]:
import pandas as pd
In [19]:
crimes_df = pd.read_sql_query(query, conn)

Save your results¶

You can run the following cell to save your results:

In [20]:
ChallengeResult('sql', query=query, data=crimes_df).write()

1.4 Linear regression - The socio-economic determinants of criminality¶

As mentioned above, we want to investigate the impact of the socio-economic characteristics of the different Boston police districts on the number of incidents that are reported in these areas.

  • We are going to use the number of incidents as dependent variable
  • our regressors will be the various socio-economic indicators extracted from the database.

1.4.1 Start from a fresh dataset¶

To make sure that you are using the right data, you can load a fresh dataset from this URL.

📝 Load the data into a DataFrame named data

In [22]:
import pandas as pd
data = pd.read_csv ('/Users/jaouadahraoui/Downloads/boston_crimes_regression.csv')
data.head()
Out[22]:
MEDIAN_AGE TOTAL_POP PERC_OF_30_34 PERC_MARRIED_COUPLE_FAMILY PER_CAPITA_INCOME PERC_OTHER_STATE_OR_ABROAD PERC_LESS_THAN_HIGH_SCHOOL PERC_COLLEGE_GRADUATES CODE NB_INCIDENTS NAME
0 30.8 55297 52.8 26.4 41261 8.6 6.7 10.5 D14 13788 Brighton
1 35.7 19890 28.2 36.4 75339 3.4 7.9 8.2 A15 4765 Charlestown
2 33.4 126909 28.2 26.6 29767 2.4 18.0 17.1 C11 32875 Dorchester
3 33.5 18306 32.5 35.8 80057 14.8 15.4 6.9 A1 26260 Downtown
4 30.6 47263 31.1 30.4 31473 3.5 27.2 11.5 A7 9691 East Boston

1.4.2 Run the regression and output its summary¶

Thanks to the Statsmodels Formula API, we will run the regression described below.

The dependent variable (or target variable) should be the total number of incidents reported in each police district.

We will focus on the following regressors:

  • the median age in the district, whose effect is difficult to anticipate on the number of crimes;
  • the percentage of 30-34 years old in the district, whose effect is also unclear a priori;
  • the share of families with a married couple among all households, which could be anticipated to have a negative effect on criminality (more attention to safety among residents...);
  • the percentage of residents having moved from abroad or from another US state over the last year, mobility being often associated with social marginalisation and possibly with a higher risk of resorting to illegal activities;
  • the percentage of residents having stopped their studies before getting a high school degree. Economic models would suggest that due to the more narrow job opportunities to which this group has access, the incentive is stronger to resort to illicit activities;
  • the percentage of college graduates in the district, which we would expect to have an opposite effect.

📝 Based on these indications, estimate the linear regression model and output its summary in this section of the notebook. Store the estimated model inside a model variable.

In [23]:
import statsmodels.formula.api as smf
import statsmodels.api as sm
In [24]:
# Create target variable y (Series) and feature X (DataFrame) without data leaks!
y = data["NB_INCIDENTS"].copy()

X = data[[
    "MEDIAN_AGE",
    "PERC_OF_30_34",
    "PERC_MARRIED_COUPLE_FAMILY",
    "PERC_OTHER_STATE_OR_ABROAD",
    "PERC_OTHER_STATE_OR_ABROAD",
    "PERC_COLLEGE_GRADUATES",
    ]].copy()
X
Out[24]:
MEDIAN_AGE PERC_OF_30_34 PERC_MARRIED_COUPLE_FAMILY PERC_OTHER_STATE_OR_ABROAD PERC_OTHER_STATE_OR_ABROAD PERC_COLLEGE_GRADUATES
0 30.8 52.8 26.4 8.6 8.6 10.5
1 35.7 28.2 36.4 3.4 3.4 8.2
2 33.4 28.2 26.6 2.4 2.4 17.1
3 33.5 32.5 35.8 14.8 14.8 6.9
4 30.6 31.1 30.4 3.5 3.5 11.5
5 39.4 21.1 38.4 1.9 1.9 21.3
6 34.8 32.5 33.7 5.5 5.5 12.1
7 36.7 20.9 29.8 2.3 2.3 22.9
8 32.5 27.8 17.8 2.9 2.9 18.9
9 31.9 46.1 24.7 2.4 2.4 8.4
10 37.1 33.9 28.3 6.2 6.2 8.5
11 42.8 18.1 46.2 1.9 1.9 13.2
In [25]:
model = sm.OLS(y,X).fit()
model.summary()
/Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages/scipy/stats/stats.py:1603: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "
Out[25]:
OLS Regression Results
Dep. Variable: NB_INCIDENTS R-squared (uncentered): 0.974
Model: OLS Adj. R-squared (uncentered): 0.955
Method: Least Squares F-statistic: 52.40
Date: Tue, 29 Jun 2021 Prob (F-statistic): 2.14e-05
Time: 12:26:06 Log-Likelihood: -115.37
No. Observations: 12 AIC: 240.7
Df Residuals: 7 BIC: 243.2
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
MEDIAN_AGE 2563.0215 499.845 5.128 0.001 1381.077 3744.966
PERC_OF_30_34 -439.2307 161.790 -2.715 0.030 -821.802 -56.659
PERC_MARRIED_COUPLE_FAMILY -2160.0687 354.891 -6.087 0.000 -2999.253 -1320.884
PERC_OTHER_STATE_OR_ABROAD 955.8541 230.079 4.154 0.004 411.804 1499.904
PERC_OTHER_STATE_OR_ABROAD 955.8541 230.079 4.154 0.004 411.804 1499.904
PERC_COLLEGE_GRADUATES 177.4256 372.764 0.476 0.649 -704.022 1058.873
Omnibus: 7.823 Durbin-Watson: 1.947
Prob(Omnibus): 0.020 Jarque-Bera (JB): 3.721
Skew: 1.238 Prob(JB): 0.156
Kurtosis: 4.146 Cond. No. 1.64e+17


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.52e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Save your results¶

You can run the following cell to save your results:

In [26]:
ChallengeResult(
    'linear_regression_model',
    data=data,
    model=model
).write()

1.4.3 Statistical inference questions 🤯¶

You will now be asked a series of statistical inference and methodological questions about the regression model estimated above. Don't worry if results do not perfectly fit the "predictions" that we made earlier about the impact of each regressor: the goal was to form an a priori hypothesis, which is validated or invalidated through this empirical analysis.

Questions on the results¶

❓ Is the median age associated with a positive (increasing the number of crimes) or a negative (decreasing the target variable) effect on the number of crime incidents? Simply write your answer as a string below

In [32]:
answer_median_age = '''
Ceteris Paribus, l'augmentation d'une unité de l'âge median entraine une augmentation de 2563.0215 unités du nombre
d'accidents soit doit un effet positif sur le nombre de crimes. Cet effet est significatif au sens statistique (p-value < 5%)

'''

❓ What is the t-statistic associated with the median age regressor? How is it computed?

💡 Hint: We are looking at a ratio

In [33]:
answer_t_statistic = '''

la t-statistic est le resultat obtenu grâce au test de sgnificativité de Student. Ce résultat doit être comparé avec la p-value.
La valeur de p est souvent utilisée dans les tests d'hypothèses, tests qui vous permettent de rejeter, ou non, une hypothèse nulle. Elle représente la probabilité de faire une erreur de type 1, ou de rejeter l'hypothèse nulle si elle est vraie.

Plus la valeur de p est petite, plus la probabilité de faire une erreur en rejetant l'hypothèse nulle est faible. Une valeur limite de 0,05 est souvent utilisée. Autrement dit, vous pouvez rejeter l'hypothèse nulle si la valeur de p est inférieure à 0,05.
Dans notre cas , la variable age median a un effet significatif (p-value < 5%) sur notre target.
'''

📝 Recompute approximately the t-statistic based on the regression summary.

As it is a ratio $t = \frac{numerator}{denominator}$:

  • Store the numerator into a numerator variable
  • Store the denominator into a denominator variable
  • Store the t-statistic into a t_median_age variable
In [34]:
numerator = 2563.0215
denominator = 499.845
t_median_age = numerator / denominator
t_median_age
Out[34]:
5.127632566095489

❓ What is the p-value associated with the median age regressor? Store it in the pvalue_median_age variable (you may directly copy-paste it from the regression summary).

In [35]:
pvalue_median_age = 0.001

❓ What does this p-value mean for the median age? Is its effect statistically significant at the 95% confidence level? At the 90% confidence level? Simply formulate your answer in the cell below.

In [36]:
answer_p_value = """
La valeur de p est souvent utilisée dans les tests d'hypothèses (test de Student...) , tests qui vous permettent de rejeter, ou non, une hypothèse nulle. Elle représente la probabilité de faire une erreur de type 1, ou de rejeter l'hypothèse nulle si elle est vraie.
Plus la valeur de p est petite, plus la probabilité de faire une erreur en rejetant l'hypothèse nulle est faible. Une valeur limite de 0,05 est souvent utilisée. Autrement dit, vous pouvez rejeter l'hypothèse nulle si la valeur de p est inférieure à 0,05.
Dans notre cas , la variable age median a un effet significatif (p-value < 5%) sur notre target (95%)
"""

❓ What are the two regressors whose effect is statistically significant at the 95% confidence level in this regression model? Store the name of the variables as a list of strings in the significant_regressors variable.

In [37]:
significant_regressors = ["MEDIAN_AGE", "PERC_MARRIED_COUPLE_FAMILY"]

❓ Holding all other regressors constant, by how much does the total number of incidents increase or decrease when the share of families with a married couple increases by 1 percentage point in the district? Please formulate a full sentence with statistical rigor!

In [38]:
answer_how_much_increase = """
Ceteris Paribus, l'augmentation d'une unité de la variable PERC_MARRIED_COUPLE_FAMILY entraine une diminution de 2160.0687 unités du nombre
d'accidents soit doit un effet négatif sur le nombre de crimes. Cet effet est significatif au sens statistique (p-value < 5%)
"""

Limits of this regression model¶

You had asked the intern on the team to estimate a linear regression model so as to investigate the socio-economic determinants of crime in Boston. The results above are those that he presented. In the email he sent to you, he added:

You will probably notice the extremely high R-squared score of this model: I think we have an excellent fit and the results are solid 😄

But you have very strong doubts about this regression and you think it is a perfect occasion to give some very important advice to your intern...

❓ What is the main limitation of this (clearly spurious) regression according to you? This observation explains why we are getting a very high R-squared and large standard errors. Please provide your answer in the following Markdown cell.

In [42]:
answer_limitations = """
La principale limite de la régression linéaire est l'hypothèse de linéarité entre la variable dépendante et les variables indépendantes. 
Dans le monde réel, les données sont rarement séparables linéairement. 
Elle suppose qu'il existe une relation linéaire entre les variables dépendantes et indépendantes, ce qui est souvent incorrect.
De plus,statistiquement parlant, il faut avoir du recul sur le R2 obtenu même si il a l'air "bon".
Pour finir, un statisticien se doit de tester l'hypothèse d'HOMOSCEDASTICITÉ avant d'évaluer la signigicativité des différents coefficients
car en présence d'hétéroscedasticité, le résultat est des tests de significativité sont faussés.
"""

Save your results¶

You can run the following cell to save your results:

In [43]:
import json
answers_inference = {"MEDIAN_AGE": answer_median_age,
                    "T_STAT":answer_t_statistic,
                     "P_VALUE": answer_p_value,
                     "INCREASE": answer_how_much_increase,
                     "LIMITATIONS": answer_limitations}

with open("tests/answers_inference.json", "w", encoding="utf-8") as f:
    json.dump(answers_inference, f, ensure_ascii=False, indent=4)

ChallengeResult(
    'linear_regression_analysis',
    model=model,
    numerator=numerator,
    denominator=denominator,
    t=t_median_age,
    pvalue=pvalue_median_age,
    regressors=significant_regressors
).write()

2. Analysis for the police department¶

The head of the Police Department of Boston, who read your report for the Mayor's team, was extremely interested in the results. He contacted your consulting firm for an additional presentation, that would focus on the nature of crimes that take place in Boston, the potential time trends that you could identify and/or the heterogeneity of the impact of criminality on the different police districts.

2.1 Start with a fresh dataset¶

You will start from a fresh dataset, that corresponds more or less to the incident_reports table of the database.

In [44]:
# Run this cell to download the datasets in the data directory
!curl https://wagon-public-datasets.s3.amazonaws.com/certification_france_2021_q2/incident_reports.csv > data/incident_reports.csv   
!curl https://wagon-public-datasets.s3.amazonaws.com/certification_france_2021_q2/districts.csv > data/districts.csv    
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 19.1M  100 19.1M    0     0   884k      0  0:00:22  0:00:22 --:--:--  997kk
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   498  100   498    0     0   1759      0 --:--:-- --:--:-- --:--:--  1759
In [45]:
# Load the dataset
from data import load_data_viz_data

data = load_data_viz_data()
data[['LAT','LONG']] = data[['LAT','LONG']].astype('float64')

print("Shape of the DataFrame:", data.shape)

data.head()
Shape of the DataFrame: (237221, 9)
Out[45]:
INCIDENT_NUMBER OFFENSE_CODE_GROUP SHOOTING OCCURRED_ON_DATE LAT LONG NAME LAT_POLICE_STATION LONG_POLICE_STATION
0 I192068249 Other 0 2015-08-28 10:20:00 42.330119 -71.084251 Roxbury 42.328894 -71.085359
1 I182074094 Violence and harassment 0 2015-09-14 09:31:00 42.315142 -71.067047 Roxbury 42.328894 -71.085359
2 I182054888 Violence and harassment 0 2015-07-12 15:37:00 42.312243 -71.075499 Roxbury 42.328894 -71.085359
3 I182054888 Other 0 2015-07-12 15:37:00 42.312243 -71.075499 Roxbury 42.328894 -71.085359
4 I182054888 Other 0 2015-07-12 15:37:00 42.312243 -71.075499 Roxbury 42.328894 -71.085359

2.1 Further data manipulations¶

In this section, we are going to answer two specific questions that the head of the Police Department of Boston asked your team, about the characteristics of the incidents that occur in the various districts of the municipality.

2.1.1 Most represented type of incident per district¶

First, the head of the Police Department of Boston wants to know what incident category is most often found in each district.

📝 Construct a DataFrame with

  • one row per police district, designated by its full name
  • one text column that indicates the name of the most common category of incident in the district over the whole sample period.

💡 Hint: you may need to first define a custom aggregation function.

In [47]:
!pip install pandasql
Collecting pandasql
  Downloading pandasql-0.7.3.tar.gz (26 kB)
Requirement already satisfied: numpy in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from pandasql) (1.18.5)
Requirement already satisfied: pandas in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from pandasql) (1.2.4)
Requirement already satisfied: sqlalchemy in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from pandasql) (1.4.17)
Requirement already satisfied: python-dateutil>=2.7.3 in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from pandas->pandasql) (2.8.1)
Requirement already satisfied: pytz>=2017.3 in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from pandas->pandasql) (2021.1)
Requirement already satisfied: six>=1.5 in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas->pandasql) (1.14.0)
Requirement already satisfied: greenlet!=0.4.17 in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from sqlalchemy->pandasql) (1.1.0)
Building wheels for collected packages: pandasql
  Building wheel for pandasql (setup.py) ... done
  Created wheel for pandasql: filename=pandasql-0.7.3-py3-none-any.whl size=26818 sha256=1d6741d7a85322bc7812592ffabeeb91c01c48e2aec8e1c69c06e10821cbe3bd
  Stored in directory: /Users/jaouadahraoui/Library/Caches/pip/wheels/ed/8f/46/a383923333728744f01ba24adbd8e364f2cb9470a8b8e5b9ff
Successfully built pandasql
Installing collected packages: pandasql
Successfully installed pandasql-0.7.3
WARNING: You are using pip version 21.1.2; however, version 21.1.3 is available.
You should consider upgrading via the '/Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/bin/python3.8 -m pip install --upgrade pip' command.
In [48]:
import pandasql as ps

q1 = """SELECT NAME, OFFENSE_CODE_GROUP  
        FROM data
        GROUP BY NAME
        ORDER BY COUNT(OFFENSE_CODE_GROUP)"""

print(ps.sqldf(q1, locals()))
             NAME              OFFENSE_CODE_GROUP
0     Charlestown           Larceny and vandalism
1    West Roxbury  Police investigation procedure
2     East Boston        Fraud and law violations
3       Hyde Park        Fraud and law violations
4   Jamaica Plain                           Other
5        Brighton        Fraud and law violations
6    South Boston        Fraud and law violations
7        Downtown        Fraud and law violations
8        Mattapan         Violence and harassment
9       South End        Fraud and law violations
10     Dorchester        Fraud and law violations
11        Roxbury                           Other

❓ Can you tell what is the second most common offense int the Brighton district?

In [193]:
second_most_common_offence = "Fraud and law violations"

Average distance to the police station per district¶

Second, based on the Haversine distance function defined below, the head of the Police Department would like to know, for each district, the average distance between the location of the incident and the police station.

📝 Construct a DataFrame with one row per police district, designated by its full name, and one column that displays this average Haversine distance.

In [49]:
# Haversine distance function
from math import radians, sin, cos, asin, sqrt

def haversine_distance(lon1, lat1, lon2, lat2):
    """
    Compute distance (km) between two pairs of (lat, lng) coordinates
    See - (https://en.wikipedia.org/wiki/Haversine_formula)
    """
    
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    
    a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
    
    return 2 * 6371 * asin(sqrt(a))
In [188]:
# YOUR CODE HERE
In [190]:
# YOUR CODE HERE

❓ Can you tell what is the average distance between the police station and the offenses in the Brighton district?

In [194]:
# YOUR CODE HERE
In [195]:
result = ChallengeResult('manipulation',
                         second_most_common_offence=second_most_common_offence,
                         average_distance_km=average_distance_km)
result.write()

3. Short presentation (REQUIRED TO VALIDATE THE CERTIFICATION)¶

🚨🚨🚨🚨🚨🚨

Boston crime Report¶

alt text

In [72]:
#Top 10 crime reported streets
import pandasql as ps

q1 = """SELECT NAME, COUNT(OFFENSE_CODE_GROUP) as Number_of_offense 
        FROM data
        GROUP BY NAME"""

top_crime_reported = ps.sqldf(q1, locals())
top_crime_reported
Out[72]:
NAME Number_of_offense
0 Brighton 13788
1 Charlestown 4765
2 Dorchester 32875
3 Downtown 26260
4 East Boston 9691
5 Hyde Park 12551
6 Jamaica Plain 12802
7 Mattapan 28331
8 Roxbury 38877
9 South Boston 16617
10 South End 31258
11 West Roxbury 9406

Top Crime reported district¶

In [73]:
import seaborn as sns
ax = sns.barplot(x="NAME", y="Number_of_offense", data = top_crime_reported)
ax.set(xlabel='Crime district', ylabel='# of Crimes reported')
ax.set_xticklabels(top_crime_reported['NAME'],rotation=90)
plt.show()
In [64]:
# Top 10 Offense types
import pandasql as ps

q1 = """SELECT OFFENSE_CODE_GROUP, COUNT(OFFENSE_CODE_GROUP) as Number_of_offense 
        FROM data
        GROUP BY OFFENSE_CODE_GROUP"""

top_offense_types = ps.sqldf(q1, locals())
top_offense_types
Out[64]:
OFFENSE_CODE_GROUP Number_of_offense
0 Disputes 15421
1 Drugs and disorderly conduct 19022
2 Fraud and law violations 26801
3 Larceny and vandalism 76038
4 Other 19399
5 Police investigation procedure 50520
6 Violence and harassment 30020

Top Offense types¶

In [65]:
import seaborn as sns
ax = sns.barplot(x="OFFENSE_CODE_GROUP", y="Number_of_offense", data = top_offense_types)
ax.set(xlabel='Crime Street', ylabel='# of Crimes reported')
ax.set_xticklabels(top_crime_reported['OFFENSE_CODE_GROUP'],rotation=90)
plt.show()
In [66]:
! pip install folium
Requirement already satisfied: folium in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (0.12.1)
Requirement already satisfied: requests in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from folium) (2.25.1)
Requirement already satisfied: branca>=0.3.0 in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from folium) (0.4.2)
Requirement already satisfied: jinja2>=2.9 in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from folium) (2.11.3)
Requirement already satisfied: numpy in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from folium) (1.18.5)
Requirement already satisfied: MarkupSafe>=0.23 in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from jinja2>=2.9->folium) (1.1.1)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from requests->folium) (1.26.4)
Requirement already satisfied: idna<3,>=2.5 in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from requests->folium) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from requests->folium) (2020.12.5)
Requirement already satisfied: chardet<5,>=3.0.2 in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from requests->folium) (4.0.0)
WARNING: You are using pip version 21.1.2; however, version 21.1.3 is available.
You should consider upgrading via the '/Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/bin/python3.8 -m pip install --upgrade pip' command.

Crime map¶

In [70]:
# Create basic Folium crime map
import folium
from folium.plugins import HeatMap
crime_map = folium.Map(location=[42.3125,-71.0875], 
                       tiles = "Stamen Toner",
                      zoom_start = 11)

# Add data for heatmp 
# data_heatmap = data[data.Year == 2017]
data_heatmap = data[['LAT','LONG']]
data_heatmap = data.dropna(axis=0, subset=['LAT','LONG'])
data_heatmap = [[row['LAT'],row['LONG']] for index, row in data_heatmap.iterrows()]
HeatMap(data_heatmap, radius=10).add_to(crime_map)

# Plot!
crime_map
Out[70]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Conclusion¶

  • Parmis les crimes graves, nous avons constaté une prédominance des larcins et des vandalismes. Ainsi, il serait interessant de mettre en place une équipe en charge de ce type de crimes afin d'y apporter des solutions visant à réduire sa prédominance.

  • Aussi, nous avons constaté une prédominance des crimes graves dans certains districts et plus précisement dans les ditricts suivants: Roxbury, South End, Mattapan Ainsi, serait pertinent de plus de partouilles dans ces zones là afin de leur permettre d'intervenir en nombre et plus rapidement.

Cette EDA ne fait qu'effleurer la surface de l'ensemble des données:

  • D'autres analyses pourraient explorer la façon dont les différents types de crimes varient dans le temps et l'espace.
  • Une autre piste intéressante serait de combiner ces données avec d'autres données sur Boston, telles que la démographie ou même la météo, afin d'étudier les facteurs qui permettent de prédire les taux de criminalité dans le temps et l'espace.

Using the latest dataset that you loaded, your mission is now to prepare 5 slides (including a title slide) that you would present to the head of the Police Department. You may or may not, as you prefer, include the output of the two "Further data manipulations" tasks in your presentation.

⚠️ You can use any presentation editor of your choice, but the slides must be shared either in HTML or in PDF format and saved in the current directory

Before you get started, here are four small pieces of advice:

  • to prepare your data visualisations, do not hesitate to create a separate, blank notebook;
  • pay particular attention to the readability and the clarity of your legends, titles, charts and tables;
  • the Pandas resample method might be useful if you want to plot time trends;
  • keep in mind that you are working with real data and sometimes, data are not very talkative. Do not feel discouraged if your charts do not provide revolutionary insights: typically, an absence of trend is a substantial piece of information!

Good luck in changing Boston residents' lives!

A word of conclusion¶

Congratulations for going through the exercise 🎉

If you wish to pursue your analysis at some point, note that all datasets (and many others) are publicly available online, on the Analyze Boston website.

Besides, if you are interested in the topic, you can start by reading the work of Nobel Prize laureate Gary Becker, who was the first to model crime as a rational phenomenon, similarly to an economic decision. This model, although it has limitations, marked a breakthrough in the study of crime and paved the way for many empirical studies that further analysed the socio-economic determinants of illegal activities.

👉 Link to download a full-text version of "Crime and Punishment: An Economic Approach" by Becker (1968)